library(tidyverse)
library(ggplot2)
library(plotly)
library(here)
#setwd(here('GitHub', 'CUNY_MSDA', 'Fall_2017', 'DATA_606', 'Project Proposal'))
#
# Load the files from the working directory
#
amendments_raw <- read.csv("amendment_list.csv")
members_raw <- read.csv("congress_terms.csv")
#
# Remove empty column, and remove all rows with missing data
#
bills <- read.csv("bills93-114.csv", header = T, na.strings = c('', 'NULL')) %>%
  select(-7)
bills <- bills[complete.cases(bills), ]

Part 1: Introduction

Is there a relationship between the average age of congress (members) and the number of constitutional amendments proposed?

The average age of congressional representatives has been steadily climbing since the second world war. The current (115th) one is among the oldest in its history. How has this affected the effectiveness of congress? Are older more representatives more or less active?

I plan to explore this via proxy, by taking a look at all the constitutional amendments proposed since the first congress through the 113th, and recording the age of each of the bill's sponsors. Additionally, I will seek any interesting tidbits in the data, such as the most active years, as well as which state representatives propose the most legislation.

Part 2: Data

Data collection

The amendment list was retrieved from Kaggle, while the members list was taken from FiveThirtyEight. Another source is from the Wall Street Journal. For the analysis, I used a different dataset, retrieved from CongressionalBills.org.

The list of 11,000+ amendments was compiled by staff and volunteers of the National Archives and Records Administration. The list of representatives was compiled by The UnitedStates Project (House members), and The New York Times Congress API (senate).

Cases

Each case represents a constitional amendment proposed by congress. There are a total of 11797 cases in this dataset.

Variables

The response variable is legislative activity and is numerical.

The explanatory variable is median age of congressional representatives and is numerical.

Type of study

This is an observational study.

Scope of inference - generalizability

This is a large enough sample of bills passed that we can generalize the results to the overall 'population'.

Scope of inference - causality

The data cannot be used to establish causal links, since it's only an observational study.

Part 3: Exploratory data analysis

#
# Tidy the datasets
#
# Keep only the relevant columns
amendments <- amendments_raw %>%
  select(5, 7:ncol(amendments_raw)-1, -6)

# Use regex to shift errant data to their appropriate columns
for (i in 1:(length(amendments$year))) {
  pat <- "\\D{3,}"
  if (grepl(pat, amendments[i, "month"]) == 1)
  {
    amendments[i, "year"] <- amendments[i, "month"]
    amendments[i, "month"] <- amendments[i, "day"]
    amendments[i, "day"] <- amendments[i, "congress"]
    amendments[i, "congress"] <- amendments[i, "congressional_session"]
    amendments[i, "congressional_session"] <- amendments[i, "joint_resolution_chamber"]
  }
}
amendments$year <- gsub("\\D{4}$", "", amendments$year)

members <- members_raw %>%
  select(-c(3,7))

Let's take a look at what the data has to say.

#
# Which years had the most bills?
#

by_year_graph <- ggplot(amendments, aes(year)) +
  geom_bar() +
  scale_x_discrete(breaks=seq(1788, 2014, 20))
by_year_graph <- ggplotly(by_year_graph)
by_year_graph

There is a noticeable spike in the 60s through 80s; my guess would be it's related to the civil rights movement. We can see the most common titles/descriptions of all bills:

head(summary(amendments$title_or_description_from_source))
##      Equal rights for men and women      Equal rights regardless of sex 
##                                 601                                 399 
##                Balancing the budget                       Right to vote 
##                                 288                                 246 
##            Prayer in public schools Apportionment of State legislatures 
##                                 241                                 208

Three of the top four most common amendments are indeed related to civil rights.

amendments <- amendments %>%
  filter(!sponsor_state_or_territory %in% "")
by_state_graph <- ggplot(amendments, aes(sponsor_state_or_territory)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 90))
  #scale_x_discrete(breaks=seq(1788, 2014, 20))
by_state_graph <- ggplotly(by_state_graph)
by_state_graph

New York congressmen have proposed the most amendments, followed by those from Texas and California.

#
# Create new df of unique congressmen - remove duplicates
#
members_unique <- members %>%
  distinct(lastname, birthday, .keep_all = T)

table(members_unique$party)
## 
##   AL    D    I   ID    L    R 
##    2 1662   11    2    1 1541
table(members_unique$incumbent)
## 
##   No  Yes 
## 2782  437
summary(members$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.00   45.40   53.00   53.31   60.55   98.10

The two major parties dominate, and are roughly even in number.

Since we removed duplicates, it's reporting most congressmen were not incumbents; if we were to include the duplicates, surely the incumbents column would be several times larger.

The average senator is 53 years old at the time of their inauguration. I'm not sure what's more surprising - that there was a congress member aged 25, or that there was a congress member aged 98!

age_year_graph <- ggplot(members, aes(congress, age)) +
  geom_point() +
  stat_summary(aes(y = age, group = 1), fun.y = mean, colour = "red", geom = "line", group = 1)
age_year_graph <- ggplotly(age_year_graph)
age_year_graph

Congress is definitely getting older. The average age of congress members in the 80th congress was \(\approx 52.5\) years old. In the 113th congress, the average age was \(\approx 57.6\) years old!

Part 4: Inference

\(H_0\): Each bill has an equal chance to pass, regardless of the sponsor's party, age, district, terms served, etc.

\(H_1\): Each bill does not have an equal chance, i.e. there are variables that can alter the probability of passing.

Independence: We can assume that the bills are independent of each other. Sample size: Each sample has at least 5 cases.

formula <- 'PLaw ~ Age + ComC + CumHServ + District + Gender + Majority + Party + State'
model <- glm(formula= formula, data=bills, family=binomial(link="logit"))
summary(model)
## 
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = bills)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0223  -0.3716  -0.3069  -0.2514   3.0325  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.141365   0.245519 -16.868  < 2e-16 ***
## Age          0.017993   0.001471  12.235  < 2e-16 ***
## ComC         0.515103   0.037889  13.595  < 2e-16 ***
## CumHServ     0.018060   0.001880   9.606  < 2e-16 ***
## District    -0.002248   0.001098  -2.047 0.040650 *  
## Gender      -0.222755   0.069479  -3.206 0.001346 ** 
## Majority     0.692814   0.035549  19.489  < 2e-16 ***
## PartyR       0.672183   0.035726  18.815  < 2e-16 ***
## StateAL     -0.311598   0.244820  -1.273 0.203103    
## StateAR      0.285120   0.241857   1.179 0.238446    
## StateAZ     -0.177144   0.262651  -0.674 0.500028    
## StateCA     -0.461073   0.227616  -2.026 0.042799 *  
## StateCO     -0.063225   0.248109  -0.255 0.798856    
## StateCT     -0.817284   0.249150  -3.280 0.001037 ** 
## StateDE     -0.303827   0.323655  -0.939 0.347866    
## StateFL     -0.473599   0.237894  -1.991 0.046503 *  
## StateGA      0.068132   0.240360   0.283 0.776825    
## StateHI     -1.130915   0.304718  -3.711 0.000206 ***
## StateIA     -0.835370   0.256716  -3.254 0.001138 ** 
## StateID     -0.260772   0.278561  -0.936 0.349201    
## StateIL     -0.966694   0.234794  -4.117 3.84e-05 ***
## StateIN     -0.815532   0.251223  -3.246 0.001169 ** 
## StateKS     -0.828676   0.258334  -3.208 0.001338 ** 
## StateKY     -0.400810   0.246119  -1.629 0.103415    
## StateLA     -0.197388   0.241399  -0.818 0.413539    
## StateMA     -0.500389   0.234997  -2.129 0.033226 *  
## StateMD     -0.279738   0.238939  -1.171 0.241700    
## StateME     -0.490469   0.275579  -1.780 0.075113 .  
## StateMI     -0.695460   0.236053  -2.946 0.003217 ** 
## StateMN     -0.703081   0.254018  -2.768 0.005643 ** 
## StateMO     -0.356222   0.243423  -1.463 0.143361    
## StateMS     -0.263297   0.247054  -1.066 0.286538    
## StateMT      0.088535   0.248909   0.356 0.722070    
## StateNC      0.274487   0.239143   1.148 0.251053    
## StateND     -1.051264   0.290165  -3.623 0.000291 ***
## StateNE     -0.542104   0.269634  -2.011 0.044376 *  
## StateNH     -1.887889   0.446190  -4.231 2.33e-05 ***
## StateNJ     -0.894611   0.236750  -3.779 0.000158 ***
## StateNM      0.031731   0.251805   0.126 0.899721    
## StateNV     -0.857734   0.322405  -2.660 0.007804 ** 
## StateNY     -0.988787   0.226496  -4.366 1.27e-05 ***
## StateOH     -0.527559   0.234389  -2.251 0.024399 *  
## StateOK      0.235634   0.247035   0.954 0.340161    
## StateOR     -0.496259   0.256493  -1.935 0.053016 .  
## StatePA     -0.770831   0.230655  -3.342 0.000832 ***
## StateRI     -0.657776   0.273822  -2.402 0.016297 *  
## StateSC      0.136647   0.240183   0.569 0.569405    
## StateSD     -0.346970   0.264489  -1.312 0.189570    
## StateTN     -0.173729   0.242436  -0.717 0.473622    
## StateTX      0.071055   0.232054   0.306 0.759454    
## StateUT     -0.428544   0.287851  -1.489 0.136549    
## StateVA      0.062025   0.239214   0.259 0.795413    
## StateVT     -0.154349   0.366113  -0.422 0.673325    
## StateWA     -0.423932   0.243018  -1.744 0.081081 .  
## StateWI     -0.809815   0.246743  -3.282 0.001031 ** 
## StateWV     -0.709740   0.253026  -2.805 0.005032 ** 
## StateWY      0.180191   0.261022   0.690 0.489988    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 78617  on 173967  degrees of freedom
## Residual deviance: 74644  on 173911  degrees of freedom
## AIC: 74758
## 
## Number of Fisher Scoring iterations: 6
anova(model, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: PLaw
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                    173967      78617              
## Age       1  1291.38    173966      77326 < 2.2e-16 ***
## ComC      1   718.35    173965      76607 < 2.2e-16 ***
## CumHServ  1   183.60    173964      76424 < 2.2e-16 ***
## District  1    74.92    173963      76349 < 2.2e-16 ***
## Gender    1    22.46    173962      76326 2.144e-06 ***
## Majority  1   115.27    173961      76211 < 2.2e-16 ***
## Party     1   212.46    173960      75999 < 2.2e-16 ***
## State    49  1355.08    173911      74644 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
age_chart <- ggplot(bills, aes(x = Age)) +
  geom_bar()
ggplotly(age_chart)

Conclusion

Our initial question, whether age plays a role in legislation passed, seems to vary based on the type of legislation. Constitutional amendments tend to be proposed by older representatives, while overall, the introduction and sponsorship of laws leans toward younger ones.

There were some obvious findings in the data: average age of congress members is climbing; the two major parties dominate in congress; most representatives are incumbents, i.e. most server multiple terms; and representatives from the more populous states tend to write more laws.

But there were also some more surprising ones: most of the amendments were proposed in the civil rights era; representatives from states that introduce less legislation often get theirs signed into law; there are variables that can affect a bill's chances of getting signed into law.

Unfortunately, it's difficult to find proper data on this subject. There are projects, both governmental and NGO-run, that are working toward 'opening' a lot of it, but it's still in the nascent stages, and, thus far, only have information on the most recent congresses. There are a lot of gaps, and not many overlaps, in the data, which practically makes it impossible to do any predictive analysis. When these projects mature, and more data is available, that would be an interesting project to undertake.